Text book: Vehkalahti, Kimmo & Everitt, Brian S. (2019). Multivariate Analysis for the Behavioral Sciences , Second Edition. Chapman and Hall/CRC, Boca Raton, Florida, USA.
My GitHub repository: https://github.com/hkallo/IODS-project
My GitHub course diary: https://hkallo.github.io/IODS-project/
# Date of submission
date()
## [1] "Fri Dec 1 18:24:54 2023"
How are you feeling right now? Feeling good, looking forward to learn new things in this course. R for Health Data Science is truly great, and I have already after reading first sections, learnt several new things which will make my scripts better and more consise. The book has very nice explanations for everything
What do you expect to learn? I already have some experience with R statistical software and basics from statistical analysis. I expect to learn about open data sources and how to utilize them as well as about GitHub, which is completely new thing to me. It is also nice to get a recap of basic things of stats and perhaps find new ways to execute things in R.
Where did you hear about the course? I saw an email.
Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1: How did it work as a “crash course” on modern R tools and using RStudio? As I already have some knowledge of R, it was pretty easy and straightforward. However, if I had no experience with this language, I don’t think I would learn too much as it is so detached from actual script writing, and nothing really had to be done yourself. The book seems very nice tool to check how to execute things in R, together with assignments better though!
Which were your favorite topics? Which topics were most difficult? Favorite chapter were pipe (%<%) and several parts from Summarizing data chapter. These helped me to understand the functions that I have already used but apparently without understanding them perfectly:) It will be easier to use this in the future. I have also tried RMarkdown before but it will be nice to learn more about it. I did not know about the cheet sheets in R, that was a great tip!
Some other comments on the book and our new approach of getting started with R Markdown etc.? I guess during upcoming weeks we get to do coding ourselves so no other comments:)
#Assignment submitted
date()
## [1] "Fri Dec 1 18:24:54 2023"
library(tidyverse); library(dplyr); library(ggplot2); library(GGally) #load libraries
data <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header = TRUE) #data upload
str(data) #structure of the object
dim(data) #dimensions of the data
Next we will create a new dataset for analysis containing only needed information: gender, age, attitude, deep, stratergic and surface level question scores, and points.
#save the variables ready in the original data frame to the new analysis data frame
learningAnalysis <- data %>%
select(gender, Age, Attitude, Points)
#find combination variables (deep, strategic and surface level questions) and save each on their own variable
deep <- data %>%
select(starts_with("D")) #NB! Excess amount of columns selected! With the next line we include only the deep question columns.
deep <- deep[,1:12]
surf <- data %>%
select(starts_with("SU"))
stra <- data %>%
select(starts_with("ST"))
#averaging the answers of selected questions and saving them to the analysis dataset
learningAnalysis$deep <- rowMeans(deep)
learningAnalysis$stra <- rowMeans(stra)
learningAnalysis$surf <- rowMeans(surf)
#scale the combination variable Attitude back to the 1-5 scale by dividing with 10, and delete the old variable
learningAnalysis$attitude <- learningAnalysis$Attitude / 10
learningAnalysis <- subset(learningAnalysis, select = -Attitude)
colnames(learningAnalysis) #check the column names and convert if needed
colnames(learningAnalysis)[2] <- "age"
colnames(learningAnalysis)[3] <- "points"
learningAnalysis <- filter(learningAnalysis, points>0) #include only the rows where points > 0
#reorder the columns:
learningAnalysis <- learningAnalysis %>%
select(gender, age, attitude, deep, stra, surf, points)
#last check
str(learningAnalysis)
head(learningAnalysis)
Now we have dataframe prepared for the subsequent analysis. Let’s save the file to the IODS Project folder.
setwd("C:/Users/Henna/Desktop/IODS/IODS-project") #set working directory to the IODS project folder
getwd() #check that it worked
write_csv(learningAnalysis, file= "learning2014.csv") #save file as csv
setwd("C:/Users/Henna/Desktop/IODS/IODS-project") #set working directory
learningAnalysis <- read.csv("learning2014.csv", header=TRUE) #read file into R
str(learningAnalysis) #structure of the data frame
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learningAnalysis) #first rows of the data frame
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
summary(learningAnalysis) #summary of the variables
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Data description:
Research question: Does students’ gender/age/attitude/question scores impact the success in the course completion (gained points)? This is measured with several different level (deep, strategic, surface) questions. Set of questions are used to quantify attitude on scale 1-5. Points represent the points that students have gained from the course. The data also includes information of students age and gender.
Data frame structure: There are 166 observations in each variable. There are total 7 variables: gender (character), age (integer), attitude (numeric), deep (numeric), strategic (numeric) and surface (numeric) level questions, and points (integer).
The summary table above describes summaries of variables: min, max, mean, median and 1st and 3rd quartiles.
There are 110 females and 56 men in this study.
The students are aged from 17 up to 55.
Attitude scores gained range between 1.4-5.0, average being 3.14.
The mean scores of deep, strategic and surface questions are 3.68, 3.12 and 2.79, respectively.
Minimum points gained is 7.00, maximum points 33. Average of points is 22.72. 50% of all scores fit to the range 19.0-27.75 points.
To test this, we will perform regression analysis, which is a statistical method that describes the relationship between two or more variables.
Graphical overview of the data:
#Function for adding correlation panel
panel.cor <- function(x, y, ...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y), digits=2)
txt <- paste0("R = ", r)
cex.cor <- 0.8/strwidth(txt) #if want to adjust text size based on R value
text(0.5, 0.5, txt, cex = 1)} #if adjusting; cex=cex.cor
#Function for adding regression line
upper_panel_regression_line = function(x,y, ...){
points(x,y,...)
linear_regression = lm(y~x)
linear_regression_line = abline(linear_regression)}
my_cols <- c("#00AFBB", "#E7B800") #set colors
learningAnalysis$gender<-as.factor(learningAnalysis$gender) #for being able to change color, convert gender to factor type
#Visualization with a scatter plot + regression line matrix, add R values to the lower side of the matrix.
pairs(learningAnalysis[-1], col = my_cols[learningAnalysis$gender],
lower.panel = panel.cor , upper.panel = upper_panel_regression_line)
Scatter plot shows the distribution of the observations (female turquoise, male yellow). Regression lines give some indication whether there is or isn’t an association between two variables. For instance, there seems to be positive dependency between the attitude and points: line goes upwards and it is steeper than any other line. Also, R-value (correlation coefficient) for points vs. attitude is 0,44, suggesting positive correlation between these variables. If the slope is (close to) horizontal and R is (close to) zero, it means that there is no association between the variables. This kind of case is for example between deep and points variables. Then again, if R-value is negative and the slope of regression line is negative (line goes downhill), like between surf and points variables, the association is negative. This means that when the surface question gets higher value, the points are more likely lower. Thus, with this kind of modelling we can also predict the success at the course based on the answers to the preliminary questions. Better explanation of R-values and their meaning comes after the next figure.
Let’s check if analysis executed separately in male and female reveals differences in association of the variables.
# create a more advanced plot matrix with ggpairs()
ggpairs(learningAnalysis, mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
Let’s go through the figure. Red = women, Blue = men.
There are about two times more women in this study.
The boxplots reveal that the median age of women is a bit lower than that for men. Also, women median score of attitude is a bit lower. Strategic & surface question scores in turn are slightly higher in women.
Also density plots reveal the same. They also show that distribution of attitude and surface question scores are quite different in male and female. Density plots of age reveal that the data is right skewed. This means that there are more young participants than older ones. Age, Stra and surf have clearly unimodal distribution (=only one peak). Other variables have 1-2 peaks, sometimes depending on the gender. The scatter plot also shows the distribution of values. The same applies with the histograms. The clearest way to make conclusions from the distributions is still with the density plot.
As we are focused on studying the causal relationship between dependent variable ´points´ and explanatory variables age, attitude, deep, stra and surf, let’s focus on right most column of correlation coefficient values (measures linear correlation between two sets of data). This article presents the rule of thumb (Mukaka, 2012; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/) how to interpret different coefficient values. The strongest association is between attitude and points. This association is strong also in both genders separately. Interstingly, age seems to have stronger association with points in men than in women. This assocation is negative, meaning that older age is associated with lower scores. The next biggest impact seems to be with stra and surf variables, but their coefficients are already quite low and not statistically significant.
Multiple regression model
Now we select 3 explanatory variables to explain dependent variable “points”. This selection is based on the slopes of regression lines and R values in the figures above. The 3 highest absolute R values are selected: attitude (R=0.44), stra (R=0.15) and surf (R=-0.14) (genders not separated in the analysis).
#Multiple regression analysis
# create a regression model with multiple explanatory variables
my_model<- lm(points ~ attitude + stra + surf, data = learningAnalysis)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learningAnalysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
P-value of the F-statistic is 3.156e-08, which is highly significant. This means that at least one of the predictor variables (attitude, stra, surf) is significantly related to outcome variable.
To identify which predictor variables are significant, let’s examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statistic p-values. Attitude is significantly associated with points whereas variables stra and surf show no association. Thus, we can remove these to variables from the model. Coefficient b for attitude is ~3.40, meaning that this is the average effect on our dependent variable (points) when the predictor (attitude) increases with one unit and all the other predictors do not change.
my_model_attitude<- lm(points ~ attitude, data = learningAnalysis)
# print out a summary of the model
summary(my_model_attitude)
##
## Call:
## lm(formula = points ~ attitude, data = learningAnalysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
P-value of F-statistics is significant (4.119e-09), and the model tells that when attitude grows with 1 unit, points increase about 3.5 on average. The final equation would be: points = 11.6 + attitude*3.5.
Out of curiosity, before proceeding to quality assessment, let’s check if age should be included in the model when analyzing only explanatory variables of points in men.
menstudents <- learningAnalysis %>%
filter(gender=="M")
my_model_men<- lm(points ~ attitude + age, data = menstudents)
summary(my_model_men) #summary of the model
##
## Call:
## lm(formula = points ~ attitude + age, data = menstudents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6798 -3.2162 0.5482 2.8711 9.3838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.66207 4.56036 2.996 0.004156 **
## attitude 4.10182 1.10353 3.717 0.000487 ***
## age -0.16050 0.08465 -1.896 0.063418 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.285 on 53 degrees of freedom
## Multiple R-squared: 0.2538, Adjusted R-squared: 0.2256
## F-statistic: 9.013 on 2 and 53 DF, p-value: 0.0004272
Interestingly, this analysis indicates that age might be associated (negatively) with points in men: the older the men, the less points. P-value 0.06 is very close to statistical significance (p<0.05). However, let’s not continue further with this dataset but rather analyse both men and women together.
Quality assessment
Finally, we should perform quality assessment of the model, based on R-squared (R2) and Residual Standard Error (RSE). R2 is sensitive for the number of variables included in the model and it is adjusted to correct for the number of explanatory variables included in the prediction model. The adjusted R2 = 0.1856, meaning that “~19% of the variance in the measure of points can be predicted by attitude score. If we compare the adjusted R2 value to the previous model where we had 3 predictor variables, the values are not very different, meaning that having 3 predictors in the model did not improve the quality of the model.
#error rate
summary(my_model_attitude)$sigma/mean(learningAnalysis$points)
## [1] 0.2341752
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model is. Here we calculated the error rate by dividing the RSE by the mean of outcome variable. Thus, RSE 5.32 is corresponding to 23% error rate.
Last thing to do is to graphically explore the validity of our model assumptions by Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage plot. Residual is the difference between the observed value and the mean value that the model predicts for that observation.
# draw diagnostic plots using the plot() function
par(mfrow = c(2,2))
plot(my_model_attitude, which=c(1,2,5))
par(mfrow = c(1, 1)) #reset plotting matrix:
Conclusion: The final course points of the students are positively associated with the attitude scores based on the preliminary question asked before the course: the higher attitude score, the more points. Only one explanatory variable is included in the model as rest were not reaching significance. Quality assessment reavealed that our model is reliable.
We are exploring data from two questionnaires related to student performance in secondary education of two Portuguese schools. The data includes student grades, demographic, social and school related variables. Here we have combined data sets of Mathematics and Portuguese language subjects.
Here we study the relationships between alcohol consumption with selected variables.
Data source: http://www.archive.ics.uci.edu/dataset/320/student+performance
#import data
library(readr)
alc<-read_csv("C:/Users/Henna/Desktop/IODS/IODS-project/Data/student_performance_alcohol.csv")
alc<-as.data.frame(alc);
colnames(alc) #column names
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
str(alc) #structure of the dataset
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : num 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : num 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: num 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : num 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : num 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
All the variables listed. The variables not used for joining the two data have been combined by averaging (including the grade variables). Alcohol use (‘alc_use’) is the average of workday and weekend alcohol consumption. If average is higher than 2, alcohol consumption is considered ‘high use’. Rest of the variables are describes in the website (link given above).
We have 370 obsrevations and 35 variables in the dataframe. There are character, numerical and logistic type of variables.
Next we select 4 intersting variables to further study their relationship with the alcohol consumption. Let’s visualize the variables:
library(tidyr); library(ggplot2); library(dplyr);
gather(alc) %>% ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_bar(fill="#00AFBB") +
ggtitle("Barplots of all variables")
Below are listed the chosen 4 factors, brief description of variable, and expected relationship with alcohol consumption
Next we will see whether we can find support for our hypotheses with numerical and graphical exploration of data:
Density or frequency plots (depending on the variable), as well as cross-tabulations and plots to visualize
ggplot(alc, aes(x=goout)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("GoOut distribution") +
theme_classic()
# create cross-tabulations and plots to visualize
library(sjPlot)
tab_xtab(var.row = alc$goout, var.col = alc$high_use, title = "Cross-Tab of GoOut & High alcohol consumption", show.row.prc = TRUE)
| goout | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 1 |
19 86.4 % |
3 13.6 % |
22 100 % |
| 2 |
82 84.5 % |
15 15.5 % |
97 100 % |
| 3 |
97 80.8 % |
23 19.2 % |
120 100 % |
| 4 |
40 51.3 % |
38 48.7 % |
78 100 % |
| 5 |
21 39.6 % |
32 60.4 % |
53 100 % |
| Total |
259 70 % |
111 30 % |
370 100 % |
χ2=55.574 · df=4 · Cramer’s V=0.388 · p=0.000 |
plot_xtab(alc$goout, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)
Density plot shown that most of the students get ‘goout’ scoring 2-4. Cross-tabulation shows that the bigger is the ‘goout’ score, the bigger proportion there is of observations with high_use=TRUE. This indicates that our hypothesis was correct: higher value of the goout is linked with heavier alcohol consumption.
ggplot(alc, aes(x=absences)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Absences distribution") +
theme_classic()
tab_xtab(var.row = alc$absences, var.col = alc$high_use, title = "Cross-Tab of Absences & High alcohol consumption", show.row.prc = TRUE)
| absences | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 0 |
50 79.4 % |
13 20.6 % |
63 100 % |
| 1 |
37 74 % |
13 26 % |
50 100 % |
| 2 |
40 71.4 % |
16 28.6 % |
56 100 % |
| 3 |
31 81.6 % |
7 18.4 % |
38 100 % |
| 4 |
24 68.6 % |
11 31.4 % |
35 100 % |
| 5 |
16 72.7 % |
6 27.3 % |
22 100 % |
| 6 |
16 76.2 % |
5 23.8 % |
21 100 % |
| 7 |
9 75 % |
3 25 % |
12 100 % |
| 8 |
14 70 % |
6 30 % |
20 100 % |
| 9 |
5 45.5 % |
6 54.5 % |
11 100 % |
| 10 |
5 71.4 % |
2 28.6 % |
7 100 % |
| 11 |
2 40 % |
3 60 % |
5 100 % |
| 12 |
3 42.9 % |
4 57.1 % |
7 100 % |
| 13 |
1 50 % |
1 50 % |
2 100 % |
| 14 |
1 14.3 % |
6 85.7 % |
7 100 % |
| 16 |
0 0 % |
1 100 % |
1 100 % |
| 17 |
0 0 % |
1 100 % |
1 100 % |
| 18 |
1 50 % |
1 50 % |
2 100 % |
| 19 |
0 0 % |
1 100 % |
1 100 % |
| 20 |
2 100 % |
0 0 % |
2 100 % |
| 21 |
1 50 % |
1 50 % |
2 100 % |
| 26 |
0 0 % |
1 100 % |
1 100 % |
| 27 |
0 0 % |
1 100 % |
1 100 % |
| 29 |
0 0 % |
1 100 % |
1 100 % |
| 44 |
0 0 % |
1 100 % |
1 100 % |
| 45 |
1 100 % |
0 0 % |
1 100 % |
| Total |
259 70 % |
111 30 % |
370 100 % |
χ2=43.001 · df=25 · Cramer’s V=0.341 · Fisher’s p=0.004 |
plot_xtab(alc$absences, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)
Density plot shown that most of the student have absence score zero (right skewed). Cross-tabulation shows that the more absences there are, the bigger proportion tend to have higher alcohol consumption. This indicates that our hypothesis was correct: higher number of absences is linked with heavier alcohol consumption.
ggplot(alc, aes(x=failures)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Failures distribution") +
theme_classic()
tab_xtab(var.row = alc$failures, var.col = alc$high_use, title = "Cross-Tab of failures & High alcohol consumption", show.row.prc = TRUE)
| failures | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 0 |
238 73.2 % |
87 26.8 % |
325 100 % |
| 1 |
12 50 % |
12 50 % |
24 100 % |
| 2 |
8 47.1 % |
9 52.9 % |
17 100 % |
| 3 |
1 25 % |
3 75 % |
4 100 % |
| Total |
259 70 % |
111 30 % |
370 100 % |
χ2=14.304 · df=3 · Cramer’s V=0.197 · Fisher’s p=0.002 |
plot_xtab(alc$failures, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)
Also the density plot of failures is right skewed, meaning that most of the student pass the class Cross-tabulation shows that higher number of failures is linked with the bigger proportion of high alcohol consumption. This indicates that our hypothesis was correct: higher value of failures is linked with increase in heavy alcohol consumption.
ggplot(alc, aes(x=romantic)) +
geom_bar(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Romantic relationship status frequencies") +
theme_classic()
tab_xtab(var.row = alc$romantic, var.col = alc$high_use, title = "Cross-Tab of Romantic relationship status & High alcohol consumption", show.row.prc = TRUE)
| romantic | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| no |
173 68.9 % |
78 31.1 % |
251 100 % |
| yes |
86 72.3 % |
33 27.7 % |
119 100 % |
| Total |
259 70 % |
111 30 % |
370 100 % |
χ2=0.286 · df=1 · φ=0.034 · p=0.593 |
plot_xtab(alc$romantic, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)
The frequency table shows that there is about 1/3 of students in romantic relationship whereas ~2/3 are with single status. Cross-tabulation indicates that even though there is slightly bigger percentage of single + high alc use students, this difference is most likely not statistically meaningful. Thus, this results does not support our hypothesis stating ‘there might be an association between variables ’romantic’ and ‘high_use’’.
We are interested whether the alcohol consumption has an association with the chosen set of characteristics of the students. Binary logistic regression can tell us the probability of this. We choose binary logistic regression because the outcome variable, ‘high_use’, has two level (TRUE/FALSE). Explanatory variables can be other types as well.
#binary logistic regression
m <- glm(high_use ~ failures + absences + goout + romantic, data = alc, family = "binomial")
summary(m) # a summary of the model
##
## Call:
## glm(formula = high_use ~ failures + absences + goout + romantic,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0472 -0.7510 -0.5266 0.8886 2.3474
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.55115 0.44211 -8.032 9.57e-16 ***
## failures 0.56492 0.22385 2.524 0.011612 *
## absences 0.07762 0.02243 3.461 0.000538 ***
## goout 0.70634 0.11846 5.963 2.48e-09 ***
## romanticyes -0.35224 0.27737 -1.270 0.204111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 382.41 on 365 degrees of freedom
## AIC: 392.41
##
## Number of Fisher Scoring iterations: 4
Let’s go through the output:
First we have the distribution of the deviance residuals.
The next we get the coefficients, their standard errors, the z-statistic, and the associated p-values. Failures, absences and goout are statistically significant. Variable ‘romantic’ is non-significant.
The logistic regression coefficients give the change in the log odds of the outcome (high_use) for a one unit increase in the predictor variable: + for every one unit change in failures, the log odds of high_use=yes increases by 0.56. + for every one unit change in absences, the log odds of high_use=yes increases by 0.08. + for every one unit change in goout, the log odds of high_use=yes increases by 0.71.
From the results we can construct the logistic regression equation (leave out statistically non-significant variable):
log(odds[high_use=yes]) = -3.55115 + 0.56492 * failures + 0.07762 * absences + 0.70634 * goout
Next we will compute the odds ratios (OR) and confidence intervalss (CI):
coef(m) # the coefficients of the model
## (Intercept) failures absences goout romanticyes
## -3.55114961 0.56492346 0.07762078 0.70634191 -0.35223716
OR <- coef(m) %>% exp # compute odds ratios (OR)
CI <- confint(m) %>% exp # compute confidence intervals (CI)
cbind(OR, CI) # print out the odds ratios with their confidence intervals
## OR 2.5 % 97.5 %
## (Intercept) 0.02869164 0.01165358 0.06618211
## failures 1.75931312 1.14014676 2.75402869
## absences 1.08071276 1.03551949 1.13208308
## goout 2.02656433 1.61539789 2.57288191
## romanticyes 0.70311335 0.40369796 1.20112415
We can conclude the following: - for one unit increase in failures, the odds of having high alcohol consumption increases by factor of 1.76. (the outcome is 76% more likely) - for one unit increase in absences, the odds of having high alcohol consumption increases by factor of 1.08. (the outcome is 8% more likely) - for one unit increase in goout, the odds of having high alcohol consumption increases by factor of 2.03. (there is a doubling of the odds of the outcome)
CI is used to estimate the precision of the OR. A large CI indicates a low level of precision of the OR, whereas a small CI indicates a higher precision of the OR.
These results support our hypotheses of the effects of failures, absences and goouts, but not about the effect of romantic relationship status.
Next, we will explore the predictive power of the model. We will include only the statistically significant variables: failures, absences & goout.
m_pred <- glm(high_use ~ failures + absences + goout, data = alc, family = "binomial")
probabilities <- predict(m_pred, type = "response") # predict the probability of high_use
alc <- mutate(alc, probability = probabilities) # add the predicted probabilities to 'alc'
alc <- mutate(alc, prediction = probability > 0.5) # use the probabilities to make a prediction of high_use
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, goout, high_use, probability, prediction) %>% tail(10)
## failures absences goout high_use probability prediction
## 361 0 3 3 FALSE 0.21435314 FALSE
## 362 1 0 2 FALSE 0.15791281 FALSE
## 363 1 7 3 TRUE 0.38937775 FALSE
## 364 0 1 3 FALSE 0.19036196 FALSE
## 365 0 6 3 FALSE 0.25431749 FALSE
## 366 1 2 2 FALSE 0.17871716 FALSE
## 367 0 2 4 FALSE 0.33847888 FALSE
## 368 0 3 1 FALSE 0.06266341 FALSE
## 369 0 4 5 TRUE 0.54534711 TRUE
## 370 0 2 1 TRUE 0.05843362 FALSE
table(high_use = alc$high_use, prediction = alc$prediction) # tabulate the target variable versus the predictions
## prediction
## high_use FALSE TRUE
## FALSE 237 22
## TRUE 66 45
ggplot(alc, aes(x = probability, y = high_use, col = prediction))+ #plot of 'high_use' versus 'probability' in 'alc'
geom_point()+
ggtitle("Plotted confusion matrix results ")
The printout of the dataframe shows the new columns; predicted probabilities and prediction of high_use.
A confusion matrix visualizes and summarizes the performance of a classification algorithm: + true negatives: 237 + true positives: 45 + false positives (type I error): 22 + false negatives (type II error): 66
Next, let’s compute the average number of incorrect predictions. The mean of incorrectly classified observations can be thought of as a penalty (loss) function for the classifier. Less penalty = good.
table(high_use = alc$high_use, prediction = alc$prediction) %>% # tabulate the target variable versus the predictions
prop.table() %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64054054 0.05945946 0.70000000
## TRUE 0.17837838 0.12162162 0.30000000
## Sum 0.81891892 0.18108108 1.00000000
loss_func <- function(class, prob) { # define a loss function (mean prediction error)
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability) #compute the average number of wrong predictions in the (training) data
## [1] 0.2378378
This analysis revealed that the average number of wrong predictions is ~24%.
Now we continue to the 10-fold cross-validation of the model
# K-fold cross-validation
library(boot)
#trainingdata
cv_train <- cv.glm(data = alc, cost = loss_func, glmfit = m_pred, K = nrow(alc))
#testingdata
cv_test <- cv.glm(data = alc, cost = loss_func, glmfit = m_pred, K = 10)
# average number of wrong predictions
cv_train$delta[1]
## [1] 0.2405405
cv_test$delta[1]
## [1] 0.2432432
The comparison of the average number of the wrong predictions in training and testing sets, we see that the error is about the same. The error is slightly smaller than in the exercise set, meaning that including failures, absences and goouts is a bit better model to predict the alcohol use than what was used in the exercise set.
Data source: MASS package in R: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
Literature: Part IV of “MABS4IODS” (chapters 12, 17 & 18)
Important concepts:
library(MASS); library(tidyr); library(corrplot); library(dplyr); library(ggplot2) #load libraries
rm(list=ls())
data("Boston") #load the data
Let’s take a look of the Boston data:
str(Boston) #structure of the dataset
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
With this data set we can explore connections between economical, environmental, societal and cultural features.
The descriptions of the variables are listed in here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html.
Let’s explore the summary statistics of the multivariate Boston data:
– each of the variables separately
– relationships between the variables (correlations)
summary(Boston) #summary of each variable separately
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The scatterplot of each variable-combination can be use as the first indicative visualization of associations between the variables. In addition to this, we print an illustrative correlation matrix visualization, which presents us nicely how strong is the association between the variable-pairs, and whether the association is positive or negative.
pairs(Boston,
col = "cornflowerblue",
main = "Scatterplots for each variable-combination of Boston data frame")
cor_matrix <- cor(Boston) %>% #correlation matrix
round(digits=2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6) ##visualize the correlations
The crime rate seems to be slightly positively associated with proportion of owner-occupied units built prior to 1940 and lower status of the population. Also, as accessibility to radial highways gets better and property-tax rate increases, the crime rates per capita increases.
Not that surprisingly, there is an association between industry and nitrogen oxygen levels. Moreover, higher nitrogen oxygen concentration seems to be associated with higher proportion of owner-occupied units built prior to 1940 & lower population status. The scatterplot indicates that the air pollution concentration and industry variables are in turn negatively associated with the distance to the Boston employment centres and median value of owner-occupied homes.
Lower number of rooms/apartment seems to be linked with lower population status. As expected, average number of rooms is positively associated with the median value of owner occupied homes. Lower status of the population is clearly linked with reduced median value of owner-occupied homes.
Moreover, there is a negative association between proportion of owner-occupied unit built prior to 1940 with distance to employment centres, and a positive correlation between the accessibility to radial highways and full-value property-tax rate per $10000.
The variables are on very different scales. We will make them more comparable by standardizing the dataset.
boston_scaled <- scale(Boston) # center and standardize variables
summary(boston_scaled) # summaries of the scaled variables
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
When scaling the data, we subtract the column means from the corresponding columns and divide the difference with standard deviation. This is why the mean is 0 in all of the variables. After scaling (centering & standardizing) we can better compare the variables with each other. This website briefly lists the cons of centering and scaling the variables: https://www.goldsteinepi.com/blog/thewhyandwhenofcenteringcontinuouspredictorsinregressionmodeling/index.html
“A further question that is often of interest for grouped multivariate data is whether or not it is possible to use the measurements made to construct a classification rule derived from the original observations (the training set) that will allow new individuals having the same set of measurements (the test sample).” -Part IV of “MABS4IODS” -> discriminant function analysis (chapter 18)
Now will perform linear discriminant analysis: the idea is to find a linear combination of features that characterizes or separates two or more classes of objects or events. When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.
Our target variable is crime rate per capita by town. Rest of the variables are predictor variables. Our interest lies in deriving a classification rule that could use measurements of the predictor variables to be able to identify the individual’s placement on the categories of crim variable.
We start with converting the crim variable to categorical and dividing the data into four categories: low, med_low, med_high and high crime rates per capita.
class(boston_scaled) # class of the boston_scaled object
## [1] "matrix" "array"
boston_scaled<-as.data.frame(boston_scaled) # change the object to data frame
#Create a factor variable from numerical: binning by quantiles; variable `crim` (per capita crime rate by town)
summary(boston_scaled$crim) #summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
bins<-quantile(boston_scaled$crim) #save quantiles, bin limits, in 'bins'
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Next we’ll divide the data into train (80% of the data) and test sets.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data; save for later to calculate how well the model performed in prediction
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
And finally perform the analysis and plot the results:
set.seed(123)
lda.fit <- lda(crime~., data = train)
lda.fit # print the lda.fit object
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2623762 0.2524752 0.2376238
##
## Group means:
## zn indus chas nox rm age
## low 1.0670560 -0.9205618 -0.11484506 -0.9079839 0.4423969 -0.9538539
## med_low -0.1130765 -0.2441408 -0.01233188 -0.5414604 -0.1475648 -0.3472909
## med_high -0.3762641 0.1564954 0.22945822 0.3209185 0.1418128 0.4113147
## high -0.4872402 1.0172418 -0.06727176 1.0680511 -0.4586557 0.7833759
## dis rad tax ptratio black lstat
## low 0.9467169 -0.6775274 -0.7199432 -0.44459945 0.37888625 -0.7702864949
## med_low 0.2961775 -0.5506543 -0.4464689 -0.05528293 0.31532521 -0.1276760444
## med_high -0.3406808 -0.4132674 -0.3090720 -0.27200081 0.09400468 -0.0003191301
## high -0.8354530 1.6368728 1.5131579 0.77931510 -0.82222326 0.8725008995
## medv
## low 0.534754595
## med_low -0.005926261
## med_high 0.190441185
## high -0.718239451
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 8.865764e-02 0.769129355 -0.8441691517
## indus -8.773471e-03 -0.166808275 0.2151046484
## chas -9.060755e-02 -0.075092458 0.0472380273
## nox 4.674199e-01 -0.578236892 -1.4164048073
## rm -1.080262e-01 -0.080887276 -0.1824442143
## age 1.935988e-01 -0.344770321 -0.3097046168
## dis -4.492858e-02 -0.165113766 -0.0935157191
## rad 3.144849e+00 0.991686927 0.0007720325
## tax 6.750153e-05 -0.127063632 0.5903339453
## ptratio 9.937571e-02 0.007799096 -0.2798178374
## black -1.442695e-01 0.026234323 0.1668073665
## lstat 2.020085e-01 -0.187879456 0.2899838916
## medv 1.730222e-01 -0.355180840 -0.2962872788
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9437 0.0426 0.0137
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch = classes)
lda.arrows(lda.fit, myscale = 1.2)
Next, we use predict() to classify the LDA-transformed testing data. Finally, we calculate the accuracy of the LDA model by comparing the predicted classes with the true classes.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 15 1 0
## med_low 3 14 3 0
## med_high 0 7 16 1
## high 0 0 0 31
Let’s calculate the accuracy of the prediction: (15+15+19+31)/102 * 100% = ~78% of the predictions are correct ( ~22% of observations are misclassified).
Cluster analysis is a generic term for a wide range of numerical methods with the common goal of uncovering or discovering groups or clusters of observations that are homogeneous and separated from other groups. Clusters are identified by the assessment of the relative distances between points. Clustering means that some points (or observations) of the data are in some sense closer to each other than some other points.
Classifcation starts with calculating interindividual distance matrix or similarity matrix. There are many ways to calculate distances or similarities between pairs of individuals, here we use Euclidean distance. After calculating distances, we proceed to run the k-means algorithm.
K-means clustering is a unsupervised method, that assigns observations to groups or clusters based on similarity of the objects
data("Boston") #load the data again
boston_scaled_2 <- scale(Boston) # scaling variables
boston_scaled_2<-as.data.frame(boston_scaled_2) # change the object to data frame
dist_eu <- dist(boston_scaled_2) # euclidean distance matrix
summary(dist_eu) # look at the summary of the distances
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
km <- kmeans(boston_scaled_2, centers = 3) #centers = number of clusters
# plot the Boston dataset with clusters
pairs(boston_scaled_2, col = km$cluster)
Summary table of eucledian distances show the min, 1st quartile, median, mean, 3rd quartile and maximum of the distances between two points.
Let’s determine the optimal number of clusters
When plotting the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically
#K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled_2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The plot above represents the variance within the clusters. It decreases as k increases, but it can be seen a bend at k = 6. This bend indicates that additional clusters beyond the sixth have little value. In the next section, we’ll classify the observations into 6 clusters. (more info: https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/)
km <- kmeans(boston_scaled_2, centers = 6) # k-means clustering
print(km)
## K-means clustering with 6 clusters of sizes 49, 47, 34, 78, 124, 174
##
## Cluster means:
## crim zn indus chas nox rm age
## 1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554 1.5399833 0.07933756
## 2 -0.2834643 -0.4872402 1.5965043 -0.2723291 1.0453673 -0.6245590 0.94375129
## 3 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.2756681 0.37213224
## 4 -0.4126954 1.9952361 -1.1032525 -0.2218534 -1.1552982 0.6080657 -1.40363754
## 5 1.1156495 -0.4872402 1.0149946 -0.2723291 0.9916473 -0.4276828 0.75159525
## 6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
## dis rad tax ptratio black lstat
## 1 -0.2868574 -0.520140581 -0.82627237 -1.0314738 0.35523433 -0.9636997
## 2 -0.8963217 -0.622669809 0.03772813 -0.2084479 -0.08717824 0.7185475
## 3 -0.4033382 0.001081444 -0.09756330 -0.3924585 0.17154271 -0.1643525
## 4 1.5772490 -0.627024335 -0.57588304 -0.7161406 0.35335146 -0.9028481
## 5 -0.8170870 1.659602895 1.52941294 0.8057784 -0.81154619 0.9129958
## 6 0.2769540 -0.587168164 -0.59021294 0.1702603 0.30993538 -0.1365045
## medv
## 1 1.6147330
## 2 -0.5152915
## 3 0.5733409
## 4 0.6621944
## 5 -0.7713403
## 6 -0.1747229
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 6 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 4 6 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 6
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 6 6 6 4 4 4 4 6 6 6 6 6 6 6 6 6 6 6 6 6
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 4 6 6 6 6 6 6 6 1 1 6 6 6 6 6 6 6 1 1 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 2 3 2 2 2 2 2 2 2 2 2 3 2 3 3 2 1 2 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 3 1 3 3 2 2 1 2 2 2 2 2 6 6 6 1 6 6 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 1 1 1 6 6 1 4 4 4 4 4 4 4 4 4 4 4 4 4
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 4 4 4 4 4 6 6 6 3 3 3 3 3 6 6 6 3 1 3 3
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 3 3 3 1 1 1 1 1 1 1 6 1 1 1 3 6 3 1 4 4
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 4 6 4 4 6 6 4 6 6 4 4 4 4 4 4 4 4 1 1 1
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 1 1 1 1 1 6 1 1 1 3 6 6 6 3 3 4 3 3 4 1
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 1 1 3 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 4 4
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 4 4 4 4 1 6 1 1 6 6 6 6 6 6 6 6 6 6 6 6
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 6 6 6 6 6 6 6 6 6 6 6 4 4 6 6 6 6 6 6 6
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 6 4 6 4 4 6 6 4 4 4 4 4 4 4 4 4 3 3 3 5
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 5 5 5 3 3 5 5 5 5 3 3 5 3 5 5 5 5 5 5 5
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 5 5 5 5 5 5 5 5 2 2 2 2 2 6 6 6 6 6 6 6
## 501 502 503 504 505 506
## 6 6 6 6 6 6
##
## Within cluster sum of squares by cluster:
## [1] 209.5683 263.8282 340.7321 363.4559 984.9444 590.5735
## (between_SS / total_SS = 61.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Let’s visualize the clusters with the pairs() function:
pairs(boston_scaled_2, col = km$cluster)
data("Boston")
boston_scaled_bonus <- scale(Boston) # scaling
boston_scaled_bonus<-as.data.frame(boston_scaled_bonus) # change the object to data frame
#clusters: km$cluster
#LDA
set.seed(123)
lda.fit_bonus <- lda(km$cluster~., data = boston_scaled_bonus)
lda.fit_bonus # print the lda.fit object
## Call:
## lda(km$cluster ~ ., data = boston_scaled_bonus)
##
## Prior probabilities of groups:
## 1 2 3 4 5 6
## 0.09683794 0.09288538 0.06719368 0.15415020 0.24505929 0.34387352
##
## Group means:
## crim zn indus chas nox rm age
## 1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554 1.5399833 0.07933756
## 2 -0.2834643 -0.4872402 1.5965043 -0.2723291 1.0453673 -0.6245590 0.94375129
## 3 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.2756681 0.37213224
## 4 -0.4126954 1.9952361 -1.1032525 -0.2218534 -1.1552982 0.6080657 -1.40363754
## 5 1.1156495 -0.4872402 1.0149946 -0.2723291 0.9916473 -0.4276828 0.75159525
## 6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
## dis rad tax ptratio black lstat
## 1 -0.2868574 -0.520140581 -0.82627237 -1.0314738 0.35523433 -0.9636997
## 2 -0.8963217 -0.622669809 0.03772813 -0.2084479 -0.08717824 0.7185475
## 3 -0.4033382 0.001081444 -0.09756330 -0.3924585 0.17154271 -0.1643525
## 4 1.5772490 -0.627024335 -0.57588304 -0.7161406 0.35335146 -0.9028481
## 5 -0.8170870 1.659602895 1.52941294 0.8057784 -0.81154619 0.9129958
## 6 0.2769540 -0.587168164 -0.59021294 0.1702603 0.30993538 -0.1365045
## medv
## 1 1.6147330
## 2 -0.5152915
## 3 0.5733409
## 4 0.6621944
## 5 -0.7713403
## 6 -0.1747229
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## crim -0.0305347753 -0.08591377 -0.108258758 0.01370669 0.12793637
## zn -0.2349745638 -0.08924451 -0.895982678 -1.47547778 0.34969495
## indus 0.1256678211 -0.64511317 1.388877801 -1.77654417 0.39972164
## chas 5.8325152872 0.24741612 -0.311207345 -0.07443844 -0.27828298
## nox -0.0025874888 0.27316108 0.279998181 -0.33251185 0.43887081
## rm -0.0806099076 -0.01151686 -0.148388969 0.12537565 0.56954956
## age 0.0721570288 -0.01868907 0.362844031 0.22642315 0.24959116
## dis 0.0698432704 0.27583463 -0.218044854 -0.54697937 -0.14293536
## rad 0.2608126124 -3.31540916 -1.507829403 1.14007512 0.04287448
## tax -0.0049952119 0.23295413 -0.381672532 -0.48815038 -0.21073890
## ptratio -0.0007726087 0.08817263 0.041915486 -0.04862072 -0.32527265
## black 0.0078197790 0.09331354 -0.003179723 0.02328372 -0.03306096
## lstat -0.1245078745 -0.13037037 -0.008198020 -0.03449407 0.25315537
## medv -0.1868981610 0.34078428 0.026517982 0.20032076 0.88234880
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5
## 0.6255 0.2484 0.0727 0.0387 0.0147
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes_bonus <- as.numeric(km$cluster)
# plot the lda results
plot(lda.fit_bonus, dimen = 2, col=classes_bonus, pch = classes_bonus)
lda.arrows(lda.fit_bonus, myscale = 1.2)
The most influential linear separators for the clusters are chas, rad & indus.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plotly::plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)
The 3D plot helps with the separation of clusters that are overlapping on two axes.
###FIN###
Data info:
Literature: Part IV of “MABS4IODS” (chapters 13 & 14)
Dimensionality reduction techniques
PCA
MCA
In this exercise we will study The Human Development Index (HDI) dataset, which was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone.
library(readr); library(dplyr); library(tibble); library(GGally); library(corrplot); library(ggplot2); library(factoextra); library(FactoMineR); library(tidyr); # load libraries
rm(list=ls()) # clear variables
getwd()
setwd("C:/Users/Henna/Desktop/IODS/IODS-project/Data") # set working directory
human <- read_csv("C:/Users/Henna/Desktop/IODS/IODS-project/Data/human.csv") # load the data
str(human)
The data set consists of 155 observations and 9 variables.
Variables of the dataset:
Next we will calculate summaries of the variables…
# move the country names to rownames
human_ <- column_to_rownames(human, "Country")
# summaries of the variables
summary(human_)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Overall, this summary indicates that there is large variability between the countries studied.
… and explore the data and the relations between the variables garphically.
ggpairs(human_, progress=FALSE) # visualize the 'human_' variables
cor(human_) %>% # compute the correlation matrix and visualize it with corrplot
corrplot()
Perform principal component analysis (PCA) on the raw (non-standardized) human data, explore the variablity and visualize with the biplot and arrows representing the original variables.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.5, 1), col = c("grey40", "deeppink2"))
The Gross National Income per capita seems to be by far the most important component. However, we remember from the summary inspection that the variability of the GNI was massively larger than in any other variables. This is most likely going to hide the effects of the other variables.
Let’s proceed with standardizing the variables in the human data and repeat the above analysis.
human_std <- scale(human_) # standardize the variables
summary(human_std) # print out summaries of the standardized variables
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_human <- prcomp(human_std) # perform principal component analysis (with the SVD method)
s <- summary(pca_human) # create and print out a summary of pca_human
pca_pr <- round(1*s$importance[2, ], digits = 3) * 100# rounded percentages of variance captured by each PC
print(pca_pr) # print out the percentages of variance
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)") # create object pc_lab to be used as axis labels
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pca_pr[1], ylab = pca_pr[2],
xlim=c(-.4, .4),
main='PCA - Results 1',
sub='Principal components 1 and 2 explain 69.8% of the total variance between the countries.',
expand=1.2)
#fviz_cos2(pca_human, choice = "var", axes = 1:2) #The goal of the visualization is to see how much each variable is represented in a given component. Such a quality of representation is called the Cos2 and corresponds to the square cosine, and it is computed using the fviz_cos2 function.
fviz_pca_var(pca_human, col.var = "cos2",
gradient.cols = c("black", "orange", "green"),
repel = TRUE,
title='PCA - Results 2')
#http://agroninfotech.blogspot.com/2020/06/biplot-for-pcs-using-base-graphic.html#add-axis-title-and-labels
Summaries of the standardized variables reveal that now the ranges of values are on the same scales. The means are centered to zero. Standardization ensures that each variable has the same level of contribution, preventing one variable from dominating others. The results and interpretations are very different between the the standardized and non-standardized data sets. Always carefully inspect the data before the final conclusions!
Next we have printed out the percentages of the variances (PC components). We see what is the effect of each PC component. The first principal component explains ~53.6% of the total variance, the second one 16.2%. Together they cover 69.8% of the total variance. (https://www.datacamp.com/tutorial/pca-analysis-r).
With a biplot we visualize the similarities and dissimilarities between the samples, and the impact of each attribute on each of the principal components.
In this study 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). On top of this, 4 questions of personal detailes were asked.
In this task we are practicing how to use the MCA dimensionality reducing technique. This approach is similar to PCA but it’s adjusted for discrete variables.
Load the tea dataset and convert its character variables to factors. Explore the data briefly:
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
#View(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
The tea data consists of 300 observations and 36 variables. All of the variables except age, are factor type with 2-7 levels. Age is integer type.
Next, we will visualize the variables:
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "spirituality", "healthy", "frequency")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, keep_columns)
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where spirituality healthy
## chain store :192 Not.spirituality:206 healthy :210
## chain store+tea shop: 78 spirituality : 94 Not.healthy: 90
## tea shop : 30
##
## frequency
## +2/day :127
## 1 to 2/week: 44
## 1/day : 95
## 3 to 6/week: 34
# visualize the dataset
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free")+
geom_bar()+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Next we will use Multiple Correspondence Analysis (MCA) on the tea data of selected columns. With this analysis we aim to detect groups of individuals with similar profile in their answers to the questions, and associations between variable categories. http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.221 0.205 0.177 0.153 0.144 0.136 0.127
## % of var. 11.775 10.932 9.462 8.173 7.677 7.279 6.760
## Cumulative % of var. 11.775 22.707 32.169 40.342 48.019 55.298 62.058
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.116 0.107 0.105 0.101 0.091 0.082 0.063
## % of var. 6.207 5.717 5.616 5.400 4.857 4.348 3.336
## Cumulative % of var. 68.265 73.982 79.598 84.998 89.856 94.204 97.540
## Dim.15
## Variance 0.046
## % of var. 2.460
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.390 0.229 0.134 | 0.057 0.005 0.003 | -0.417
## 2 | -0.248 0.093 0.041 | -0.091 0.013 0.005 | -0.786
## 3 | -0.184 0.051 0.048 | -0.058 0.005 0.005 | -0.398
## 4 | -0.570 0.490 0.314 | -0.153 0.038 0.023 | 0.209
## 5 | -0.238 0.085 0.049 | -0.056 0.005 0.003 | 0.087
## 6 | -0.409 0.252 0.209 | -0.033 0.002 0.001 | -0.359
## 7 | -0.211 0.067 0.030 | 0.049 0.004 0.002 | -0.105
## 8 | -0.346 0.181 0.061 | 0.129 0.027 0.008 | -0.635
## 9 | 0.499 0.377 0.138 | -0.520 0.440 0.150 | 0.108
## 10 | 0.814 1.000 0.449 | -0.389 0.245 0.102 | -0.467
## ctr cos2
## 1 0.326 0.154 |
## 2 1.162 0.408 |
## 3 0.298 0.227 |
## 4 0.082 0.042 |
## 5 0.014 0.007 |
## 6 0.242 0.161 |
## 7 0.021 0.007 |
## 8 0.757 0.204 |
## 9 0.022 0.007 |
## 10 0.410 0.148 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.542 4.099 0.096 5.360 | 0.145 0.315 0.007
## Earl Grey | -0.212 1.634 0.081 -4.918 | -0.200 1.565 0.072
## green | 0.024 0.004 0.000 0.144 | 0.844 4.776 0.088
## alone | -0.080 0.235 0.012 -1.885 | 0.188 1.401 0.066
## lemon | 0.568 2.007 0.040 3.451 | 0.035 0.008 0.000
## milk | -0.230 0.631 0.014 -2.055 | -0.367 1.727 0.036
## other | 1.265 2.717 0.049 3.846 | -1.633 4.878 0.082
## tea bag | -0.653 13.661 0.557 -12.903 | -0.039 0.052 0.002
## tea bag+unpackaged | 0.737 9.642 0.248 8.611 | -0.658 8.271 0.198
## unpackaged | 1.156 9.086 0.182 7.384 | 1.901 26.449 0.493
## v.test Dim.3 ctr cos2 v.test
## black 1.431 | -0.858 12.784 0.241 -8.486 |
## Earl Grey -4.638 | 0.427 8.274 0.329 9.922 |
## green 5.129 | -0.575 2.566 0.041 -3.498 |
## alone 4.431 | -0.085 0.335 0.014 -2.014 |
## lemon 0.214 | 1.314 13.391 0.214 7.990 |
## milk -3.274 | -0.242 0.864 0.016 -2.154 |
## other -4.965 | -1.277 3.445 0.050 -3.882 |
## tea bag -0.767 | -0.186 1.388 0.045 -3.687 |
## tea bag+unpackaged -7.685 | 0.239 1.260 0.026 2.791 |
## unpackaged 12.139 | 0.257 0.557 0.009 1.638 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.101 0.109 0.335 |
## How | 0.099 0.131 0.256 |
## how | 0.572 0.570 0.045 |
## sugar | 0.116 0.000 0.296 |
## where | 0.626 0.660 0.087 |
## spirituality | 0.001 0.038 0.146 |
## healthy | 0.016 0.038 0.138 |
## frequency | 0.234 0.094 0.115 |
#To visualize the percentages of inertia explained by each MCA dimensions
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 12), title="Percentages")
# visualize MCA with the biplot of individuals and variable categories:
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali", title="MCA Biplot")
fviz_mca_var(mca, col.var = "cos2",
gradient.cols = c("black", "orange", "green"),
repel = TRUE,
title='MCA - Results')
fviz_ellipses(mca, c("where", "how"),
geom = "point")
Summary output reveals that Dim1 explains 11.775% and Dim2 10.932% of the variance. The tables also show which of the categorical variables contribute the most on the dimensions. The “Percentages” plots visualizes this.
The last two plots again highlight how these two variables, “where” and “how” contibute to the analysis.
###FIN###